Goto

Collaborating Authors

 segmentation mask


Dyn-O: Building Structured World Models with Object-Centric Representations

Neural Information Processing Systems

World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object-centric rather than monolithic representations, with the goal of more effectively capturing environment dynamics and enhancing compositional generalization. However, the development of object-centric world models has largely been explored in environments with limited visual complexity (such as basic geometries). It remains underexplored whether such models can be effective in more challenging settings. In this paper, we fill this gap by introducing Dyn-O, an enhanced structured world model built upon object-centric representations. Compared to prior work in object-centric representations, Dyn-O improves in both learning representations and modeling dynamics. On the challenging Procgen games, we demonstrate that our method can learn objectcentric world models directly from pixel observations, outperforming DreamerV3 in rollout prediction accuracy. Furthermore, by decoupling object-centric features into dynamic-agnostic and dynamic-aware components, we enable finer-grained manipulation of these features and generate more diverse imagined trajectories.


PolypSense3D: AMulti-Source Benchmark Dataset for Depth-Aware Polyp Size Measurement in Endoscopy

Neural Information Processing Systems

Accurate polyp sizing during endoscopy is crucial for cancer risk assessment but is hindered by subjective methods and inadequate datasets lacking integrated 2D appearance, 3D structure, and real-world size information. We introduce PolypSense3D, the first multi-source benchmark dataset specifically targeting depth-aware polyp size measurement. It uniquely integrates over 43,000 frames from virtual simulations, physical phantoms, and clinical sequences, providing synchronized RGB, dense/sparse depth, segmentation masks, camera parameters, and millimeter-scale size labels derived via a novel forceps-assisted in-vivo annotation technique. To establish its value, we benchmark state-of-the-art segmentation and depth estimation models. Results quantify significant domain gaps between simulated/phantom and clinical data and reveal substantial error propagation from perception stages to final size estimation, with the best fully automated pipelines achieving an average Mean Absolute Error (MAE) of 0.95 mm on the clinical data subset. Publicly released under CCBY-SA 4.0 with code and evaluation protocols, PolypSense3D offers a standardized platform to accelerate research in robust, clinically relevant quantitative endoscopic vision.


Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

Neural Information Processing Systems

The introduction of generative models has significantly advanced image superresolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at here.


PanCap Joint Panoptic Segmentation and Grounded Captions for Fine Understanding and Generation

Neural Information Processing Systems

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, regionlevel captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of visionlanguage models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. It establishes a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.


ShapeEmbed: a self-supervised learning framework for 2D contour quantification

Neural Information Processing Systems

The shape of objects is an important source of visual information in a wide range of applications. One of the core challenges of shape quantification is to ensure that the extracted measurements remain invariant to transformations that preserve an object's intrinsic geometry, such as changing its size, orientation, and position in the image. In this work, we introduce ShapeEmbed, a self-supervised representation learning framework designed to encode the contour of objects in 2D images, represented as a Euclidean distance matrix, into a shape descriptor that is invariant to translation, scaling, rotation, reflection, and point indexing. Our approach overcomes the limitations of traditional shape descriptors while improving upon existing state-of-the-art autoencoder-based approaches. We demonstrate that the descriptors learned by our framework outperform their competitors in shape classification tasks on natural and biological images. We envision our approach to be of particular relevance to biological imaging applications.


ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

Neural Information Processing Systems

Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping.


0266e33d3f546cb5436a10798e657d97-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their encouraging and constructive comments. We are pleased that they find the paper well1 written and acknowledge the novelty and originality of the proposed task, which "has a potential to spark interest"2 (R1) and "may lead to future papers studying it" (R2). Regarding the proposed framework, R1 and R2 not only find it3 "sound" and "novel" but also stress the "re-implementation ease" from which "practitioners may benefit" (R1). Still,4 the reviewers raise points of improvement (R1, R3) and suggest a discussion about a related task (R2). We carefully5 address these comments below.


Supplementary Materials: AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

Neural Information Processing Systems

The series is directed by David Yates and distributed by Warner Bros. It consists of three fantasy films as of 2022: Fantastic Beasts and Where to Find Them (2016) [1]. The movie follows Newt Scamander, a magizoologist who travels to New York with a suitcase full of magical creatures. When some of the creatures escape, he teams up with a group of people to find them before they cause any harm.


details

Neural Information Processing Systems

A.1 MONet To segment each w hframe Ft into No object representations, MONet uses a recurrent attention network to obtain No attention masks Ati [0,1]w h for i = 1,...,No that represent the probability of each pixel in Ft belonging to the i-th object, with This attention network is coupled with a component VAE with latents zti Rd for i= 1,...,No that reconstructs Ati Ft, the i-th object in the image. The latent posterior distribution q(zt|Ft,Ati)is a diagonal Gaussian with mean µti, and we use µti as the representation of the i-th object. When these representations are fed into the transformer, we use a linear projection to map the raw object/word embeddings, which lie in Rd, to a vector in RdNH, where NH is the number of selfattention heads. This step is necessary as generally the latent dimensionality of MONet, d, is less than NH whereas a transformer expects the embedding size to be divisible by NH. A.2 Self-supervised training Recall in the main text that we wrote the auxiliary self-supervised loss as auxiliary loss = X A comparison of these losses and the masking schemes is given in Figure 4. We also tested a few variations of the contrastive loss inspired by literature and tested all combinations of variations.


TextDiffuser: Diffusion Models as Text Painters

Neural Information Processing Systems

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality.